feat: Add Hook Level Lineage to SQL hooks#61535
Conversation
CI currently fails :( |
|
Thanks, I'll work on the fix, probably some rowcount mocks missing from new tests |
a29dd98 to
707d1fd
Compare
|
I think the |
707d1fd to
afab29c
Compare
|
@potiuk I see that the script updating the sql stubs was removed when moving to new repo structure in #45964. The |
|
Pinging you since you were the person originally creating the script long time ago 😄 , and then moving us to new structure, maybe uv is handling it somehow in a way that I'm not aware of. |
afab29c to
646c49d
Compare
|
I'm just wondering, not sure if possible, but lineage is a cross cutting concern, so would a decorator not be a viable solution for this? |
Yes, lineage as a cross-cutting concern is a common concept, and there are multiple ways to implement it. In this case of Hook Level Lineage, we can send assets directly via A decorator could work, but I think the current approach is clearer and more explicit. Given that this helper is limited to SQL-based hooks, there’s no strong need for a global decorator in my opinion. If you have some example of how it could look like so that it's easier to implement or easier to read lmk, I'm open to changing the approach, I just feel like there is not much to gain from decorator here - it'd be the same code, just implemented differently, and with explicit call we have full control over when we execute it - after cursor execution, before connection closed, and f.e. just once for |
Thank you for your reply and explanation. Indeed it’s not as easy as it sounds, I would certainly not change the current implementation as I like the work you did on this PR. This was just a thought of mine that maybe we should think of in the future to possibly make it more available out of the box. I wasn’t already thinking on all hooks in general, but purely on the one based on the DBApiHook, which would be a challenge already. |
providers/common/sql/src/airflow/providers/common/sql/hooks/lineage.py
Outdated
Show resolved
Hide resolved
providers/apache/hive/src/airflow/providers/apache/hive/hooks/hive.py
Outdated
Show resolved
Hide resolved
646c49d to
ed5f47d
Compare
Add hook-level lineage (HLL) reporting to SQL hooks via send_sql_hook_lineage
This PR introduces a standardized mechanism for SQL hooks to report execution metadata - SQL text, query parameters, job IDs, row counts, default database/schema - to the hook lineage collector using add_extra.
I also bumped the required sql-common version for all modified providers, so that the HLL is being emitted.
I've also added tests for most Hooks that use DbApiHook as base class, to make sure that even when some methods will be overwritten in the future, the Hook Level Lineage will still be sent (so for now we are mostly testing DbApiHook implementation multiple times, but if some db decides to overwrite
run(), I need my test to fail so that new implementation also calls HLL collector).Important context
The HLL collector is a no-op unless a collector is registered (e.g. by the OpenLineage provider). This means no runtime overhead for users who don't use lineage collection.
Motivation
Black-box operators (e.g. PythonOperator calling PostgresHook.run(sql)) currently produce no lineage. With this change, any registered collector can capture the SQL being executed, parse it for input/output datasets, and attach query IDs to lineage events - dramatically improving lineage quality without requiring operator-level changes.
Follow-up PRs
Was generative AI tooling used to co-author this PR?
Co-authored by: Cursor following the guidelines
{pr_number}.significant.rstor{issue_number}.significant.rst, in airflow-core/newsfragments.